1 Linear regression

1.1 Import data

In this workshop, we will use simulated climate data based on real data from the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194} collected in 2023. The simulated data contains several different daily measurements throughout Autumn (March-May).

We will use the following variables.

  • X3pm.temperature (daily temperature measured at 3 pm)
  • Maximum.temperature (maximum daily temperature)
  • X9am.temperature (daily temperature measured at 9 am)
  • Minimum.temperature (minimum daily temperature)

The temperature data are measured in Celsius. Please beware that the variable names are case sensitive.

Task 1: Download the data file AutumnCleaned.csv in your data folder within your STAT5002 folder. Then import the csv file into a variable called data:

### write your code here. 
data = read.csv("data/AutumnCleaned.csv", header = T)
### the following displays the dimension of the data
dim(data)
## [1] 92 16
names(data)
##  [1] "X"                              "Minimum.temperature"           
##  [3] "Maximum.temperature"            "Rainfall"                      
##  [5] "Evaporation"                    "Sunshine..hours."              
##  [7] "Direction.of.maximum.wind.gust" "Speed.of.maximum.wind.gust"    
##  [9] "X9am.temperature"               "X9am.relative.humidity"        
## [11] "X9am.wind.direction"            "X9am.wind.speed"               
## [13] "X3pm.temperature"               "X3pm.relative.humidity"        
## [15] "X3pm.wind.direction"            "X3pm.wind.speed"

Task 2: How many observations are there? How many variables are there?

Answer: There are 92 observations and 16 variables.

### This is for internal use/solution only
T3pm = data$X3pm.temperature
T9am = data$X9am.temperature
Tmax = data$Maximum.temperature
Tmin = data$Minimum.temperature

1.2 Scatter plot

Generate a scatter plot for the the daily temperature observed at 3 pm (X3pm.temperature) and the observed daily maximum temperature (Maximum.temperature), with X3pm.temperature (\(X\)) on the horizontal axis and Maximum.temperature (\(Y\)) on the vertical axis.

Generate another scatter plot for the daily temperature observed at 9 am (X9am.temperature) and the observed daily minimum temperature (Minimum.temperature), with X9am.temperature (\(X\)) on the horizontal axis and Minimum.temperature (\(Y\)) on the vertical axis.

Comment on and compare these associations based on the scatter plot.

### Write your code here
###
par(mfrow=c(1,2))
plot(T3pm, Tmax, xlab="X: 3pm temperature", ylab="Y: max temperature")
plot(T9am, Tmin, xlab="X: 9am temperature", ylab="Y: min temperature")

Answer: There is a strong positive association between X3pm.temperature and Maximum.temperature. There is also a positive association between X9am.temperature and Minimum.temperature. The association between X3pm.temperature and Maximum.temperature is stronger than that between X9am.temperature and Minimum.temperature.

1.3 Linear model

Using the function lm(), build a linear model for predicting daily maximum temperature (Maximum.temperature) given the temperature observed at 3 pm (X3pm.temperature). Generate a scatter plot for X3pm.temperature and Maximum.temperature. Plot the resulting regression line on top of the scatter plot using abline(). Predict the value of daily maximum temperature given a value of \(X=33\) for the temperature observed at 3 pm.

Use the function points(X, Y, col="red", cex=3, pch=19) to plot the predict value \(Y\) (together with the predictor \(X\)), where the options col="red", cex=3, pch=19 specify the color, the marker size, and the mark type, respectively.

### Write your code here
l = lm(Tmax~T3pm)
plot(T3pm, Tmax, xlab="X: 3pm temperature", ylab="Y: max temperature")
abline(l, col="blue")
X = 33
Y = l$coefficients[1] + l$coefficients[2]*X
Y
## (Intercept) 
##     34.7929
points(X,Y, col="red", cex=3, pch=19)

# or the alternative, not taught in class
pred = predict(l, data.frame(T3pm = c(33)))
pred
##       1 
## 34.7929

1.4 Residual plot

Generate the residual plot of the linear model built above. Comment on if the regression line is a good fit.

### Write your code here
plot(T3pm, l$residuals,  xlab="X: 3pm temperature", ylab="Y: residual")

Answer: Here the key is to understand that the changing variance of the residual (homoskedasticity or heteroskedasticity) should be used to justify a linear model. Here are some ideas. In this case, the regression line is a reasonable fit as the residual plot does not show a significant trend of changing variances. Note that although the data has changing density over X, but the variation of the residual remain constant.

1.5 Coefficient of determination

Compared to the baseline prediction, what percentage of variation in the response variable Maximum.temperature can be explained by the linear regression model fitted above.

### Write your code here
cor(T3pm, Tmax)^2
## [1] 0.8957635

Answer: Compared to the baseline prediction, about 89.58% variations in the response variable can be explained by the linear regression model.

1.6 Verification of the coefficient of determination

Recall that the coefficient of determination of the regression line is \[ 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SSE} = \sum_{i = 1}^n (y_i - a - b x_i)^2, \quad \text{SST} = \sum_{i = 1}^n (y_i - \bar{y})^2 \] where \(a\) and \(b\) are the intercept and the slope of the regression line, respectively. We want to verify that the coefficient of determination can be indeed be written as squared correlation coefficient (\(r^2\)) in R.

Task 1: Compute the sum of squared residuals/errors (SSE) of the fitted linear model and the total sum of squares of the dependent variable X3pm.temperature.

### Write your code here
SSE = sum(l$residuals^2)
SST = sum((Tmax - mean(Tmax))^2)

Task 2: Compute the coefficient of determination using the above formula, and compare it with \(r^2\), do they agree?

### Write your code here
1 - SSE/SST
## [1] 0.8957635
cor(Tmax,T3pm)^2
## [1] 0.8957635

2 Simulating chance

In an experiment, three fair dice are thrown independently. What is the chance of getting a total equal to 6?

Write R code to simulate 1000 experiments and report your result.

You should compare your result with other students, what do you observe?

set.seed(23)
m = 1000
totals=sample(1:6, m, rep = T)+sample(1:6, m, rep = T)+sample(1:6, m, rep = T)
table(totals)/m
## totals
##     3     4     5     6     7     8     9    10    11    12    13    14    15 
## 0.007 0.011 0.025 0.048 0.066 0.105 0.112 0.135 0.134 0.113 0.094 0.065 0.041 
##    16    17    18 
## 0.029 0.013 0.002
barplot(table(totals), main="1000 rolls: sum of 3 dice")

Answer: the chance of getting a total equal to 6 is approximately 0.046. Your results may differ. We will learn why we observe such variations next week.